The Annals of Statistics 

2012, Vol. 40, No. 4, 2014-2042 

DOI: 10.1214/12-AOS999 

© Institute of Mathematical Statistics. 2012 



ADAPTIVE COVARIANCE MATRIX ESTIMATION THROUGH 
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By T. Tony Cai 1 and Ming Yuan 2 

University of Pennsylvania and Georgia Institute of Technology 

Estimation of large covariance matrices has drawn considerable 
recent attention, and the theoretical focus so far has mainly been on 
developing a minimax theory over a fixed parameter space. In this 
paper, we consider adaptive covariance matrix estimation where the 
goal is to construct a single procedure which is minimax rate opti- 
mal simultaneously over each parameter space in a large collection. 
A fully data-driven block thresholding estimator is proposed. The 
estimator is constructed by carefully dividing the sample covariance 
matrix into blocks and then simultaneously estimating the entries in 
a block by thresholding. The estimator is shown to be optimally rate 
adaptive over a wide range of bandable covariance matrices. A sim- 
ulation study is carried out and shows that the block thresholding 
estimator performs well numerically. Some of the technical tools de- 
veloped in this paper can also be of independent interest. 

1. Introduction. Covariance matrix estimation is of fundamental impor- 
tance in multivariate analysis. Driven by a wide range of applications in 
science and engineering, the high-dimensional setting, where the dimension 
p can be much larger than the sample size n, is of particular current inter- 
est. In such a setting, conventional methods and results based on fixed p 
and large n are no longer applicable, and in particular, the commonly used 
sample covariance matrix and normal maximum likelihood estimate perform 
poorly. 

A number of regularization methods, including banding, tapering, thresh- 
olding and i\ minimization, have been developed in recent years for estimat- 
ing a large covariance matrix or its inverse. See, for example, Ledoit and Wolf 
(2004), Huang et al. (2006), Yuan and Lin (2007), Banerjee, El Ghaoui and 
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d'Aspremont (2008), Bickel and Levina (2008a, 2008b), El Karoui (2008), 
Fan, Fan and Lv (2008), Friedman, Hastie and Tibshirani (2008), Rocha, 
Zhao and Yu (2008), Rothman et al. (2008), Lam and Fan (2009), Roth- 
man, Levina and Zhu (2009), Cai, Zhang and Zhou (2010), Yuan (2010), 
Cai and Liu (2011) and Cai, Liu and Luo (2011), among many others. 

Let X^\ . . . ,X^ be n independent copies of a p dimensional Gaussian 
random vector X = (X\, . . . ,X p ) T ~ N(/j, £). The goal is to estimate the 
covariance matrix £ and its inverse £ _1 based on the sample {JW :i = 
1, . . . ,n}. It is now well known that the usual sample covariance matrix 



n 



n 

— Y^{X^-X){X^-X)\ 



where X = — Y™ , X^' , is not a consistent estimator of the covariance ma- 
trix £ when p> n, and structural assumptions are required in order to 
estimate £ consistently. 

One of the most commonly considered classes of covariance matrices is the 
"bandable" matrices, where the entries of the matrix decay as they move 
away from the diagonal. More specifically, consider the following class of 
covariance matrices introduced in Bickel and Levina (2008a): 

C a = C a (M , M) := |s : max^l^l : \i - j\ > k} < Mk~ a Vfc, 

(1.1) 3 * 

and < Mq- 1 < A min (E), A max (S) < M 



Such a family of covariance matrices naturally arises in a number of set- 
tings, including temporal or spatial data analysis. See Bickel and Levina 
(2008a) for further discussions. Several regularization methods have been 
introduced for estimating a bandable covariance matrix £ G C a . Bickel and 
Levina (2008a) suggested banding the sample covariance matrix £ and es- 
timating £ by £ o Bk where is a banding matrix 

B k = (K\i-j\<k)) 1 < i , j < p 

and o represents the Schur product, that is, (A o B)ij = AijBij for two ma- 
trices of the same dimensions. See Figure 1(a) for an illustration. Bickel and 
Levina (2008a) proposed to choose k x (n/logp) 1 ^ 2 ^ a+1 ^ and showed that 
the resulting banding estimator attains the rate of convergence 

q/(2o+2)n 

(1.2) ||£o£ fc -£||=O p ( ' 



logp 



uniformly over C a , where || • || stands for the spectral norm. This result 
indicates that even when p> n, it is still possible to consistently estimate 
£ G C a , so long as logp = o(n). 
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(a) Weighting matrix for banding (b) Weighting matrix for tapering 



Fig. 1. Both banding and tapering estimators can be expressed as the Schur product of 
the sample covariance matrix and a weighting matrix. Subfigures of (a) and (b) illustrate 
the weighting matrix for both estimators. 

Cai, Zhang and Zhou (2010) established the minimax rate of convergence 
for estimation over C a and introduced a tapering estimator EoTj where the 
tapering matrix is given by 

T k = (\{{k - \i-j\) + - (k/2 - \i - 

\ K / l<i,j<p 

with (x)+ = max(x,0). See Figure 1(b) for an illustration. It was shown 
that the tapering estimator EoT^ with k x fi 1 /( 2 "+ 1 ) achieves the rate of 
convergence 

(1.3) ||S o T k - E|| = O p [n- a '^ + {^f\ 

uniformly over C a , which is always faster than the rate in (1.2). This implies 
that the rate of convergence given in (1.2) for the banding estimator with 
k x (n/logp) 1 ^ 2 ^ a+1 ^ is in fact sub-optimal. Furthermore, a lower bound 
argument was given in Cai, Zhang and Zhou (2010) which showed that 
the rate of convergence given in (1.3) is indeed optimal for estimating the 
covariance matrices over C a . 

The minimax rate of convergence in (1.3) provides an important bench- 
mark for the evaluation of the performance of covariance matrix estimators. 
It is, however, evident from its construction that the rate optimal taper- 
ing estimator constructed in Cai, Zhang and Zhou (2010) requires explicit 
knowledge of the decay rate a which is typically unknown in practice. It 
is also clear that a tapering estimator designed for a parameter space with 
a given decay rate a performs poorly over another parameter space with 
a different decay rate. The tapering estimator mentioned above is thus not 
very practical. 
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This naturally leads to the important question of adaptive estimation: 
Is it possible to construct a single estimator, not depending on the decay 
rate a, that achieves the optimal rate of convergence simultaneously over a 
wide range of the parameter spaces C a ? We shall show in this paper that the 
answer is affirmative. A fully data-driven adaptive estimator £ is constructed 
and is shown to be simultaneously rate optimal over the collection of the 
parameter spaces C a for all a > 0. That is, 



In many applications, the inverse covariance matrix is of significant interest. 
We introduce a slightly modified version of S _1 and show that it adaptively 
attains the optimal rate of convergence for estimating S" 1 . 

The adaptive covariance matrix estimator achieves its adaptivity through 
block thresholding of the sample covariance matrix E. The idea of adaptive 
estimation via block thresholding can be traced back to nonparametric func- 
tion estimation using Fourier or wavelet series. See, for example, Efromovich 
(1985) and Cai (1999). However, the application of block thresholding to co- 
variance matrix estimation poses new challenges. One of the main difficulties 
in dealing with covariance matrix estimation, as opposed to function esti- 
mation or sequence estimation problems, is the fact that the spectral norm 
is not separable in its entries. Another practical challenge is due to the fact 
that the covariance matrix is "two-directional" where one direction is along 
the rows and another along the columns. The blocks of different sizes need to 
be carefully constructed so that they fit well in the sample covariance matrix 
and the risk can be assessed based on their joint effects rather than their 
individual contributions. There are two main steps in the construction of the 
adaptive covariance matrix estimator. The first step is the construction of 
the blocks. Once the blocks are constructed, the second step is to estimate 
the entries of the covariance matrix £ in groups and make simultaneous 
decisions on all the entries within a block. This is done by thresholding the 
sample covariance matrix block by block. The threshold level is determined 
by the location, block size and corresponding spectral norms. The detailed 
construction is given in Section 2. 

We shall show that the proposed block thresholding estimator £ is simul- 
taneously rate-optimal over every C a for all o. ^> 0. The theoretical analysis 
of the estimator £ requires some new technical tools that can be of indepen- 
dent interest. One is a concentration inequality which shows that although 
the sample covariance matrix £ is not a reliable estimator of S, its subma- 
trices could still be a good estimate of the corresponding submatrices of S. 
Another useful tool is a so-called norm compression inequality which reduces 
the analysis on the whole matrix to a matrix of much smaller dimensions, 
whose entries are the spectral norms of the blocks. 




for all q > 0. 
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In addition to the analysis of the theoretical properties of the proposed 
adaptive block thresholding estimator, a simulation study is carried out to 
investigate the finite sample performance of the estimator. The simulations 
show that the proposed estimator enjoys good numerical performance when 
compared with nonadaptive estimators such as the banding and tapering 
estimators. 

Besides bandable matrices considered in the present paper, estimating 
sparse covariance matrices and sparse precision matrices has also been ac- 
tively studied in the recent literature. Bickel and Levina (2008b) proposed 
a thresholding estimator for sparse covariance matrices and obtained the 
rate of convergence. Cai and Zhou (2011) developed a new general mini- 
max lower bound technique and established the minimax rate of conver- 
gence for estimating sparse covariance matrices under the spectral norm 
and other matrix operator norms. Cai and Liu (2011) introduced an adap- 
tive thresholding procedure for estimating sparse covariance matrices that 
automatically adjusts to the variability of individual entries. Estimation of 
sparse precision matrices has also drawn considerable attention due to its 
close connections to Gaussian graphical model selection. See Yuan and Lin 
(2007), Yuan (2010), Ravikumar et al. (2011) and Cai, Liu and Luo (2011). 
The optimal rate of convergence for estimating sparse inverse covariance 
matrices was established in Cai, Liu and Zhou (2011). 

The rest of the paper is organized as follows. Section 2 presents a detailed 
construction of the data-driven block thresholding estimator £. The theo- 
retical properties of the estimator are investigated in Section 3. It is shown 
that the estimator £ achieves the optimal rate of convergence simultaneously 
over each C a (Mo,M) for all a,Mo,M > 0. In addition, it is also shown that 
a slightly modified version of X -1 is adaptively rate-optimal for estimating 
X -1 over the collection C a (Mo,M). Simulation studies are carried out to 
illustrate the merits of the proposed method, and the numerical results are 
presented in Section 4. Section 5 discusses extension to subguassian noise, 
adaptive estimation under the Frobenius norm and other related issues. The 
proofs of the main results are given in Section 6. 

2. Block thresholding. In this section we present in detail the construc- 
tion of the adaptive covariance matrix estimator. The main strategy in the 
construction is to divide the sample covariance matrix into blocks and then 
apply thresholding to each block according to their sizes and dimensions. 
We shall explain these two steps separately in Sections 2.1 and 2.2. 

2.1. Construction of blocks. As mentioned in the Introduction, the appli- 
cation of block thresholding to covariance matrix estimation requires more 
care than in the conventional sequence estimation problems such as those 
from nonparametric function estimation. We begin with the blocking scheme 
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Fig. 2. Construction of blocks with increasing dimensions away from the diagonal. The 
solid black blocks are of size ko x ko . The gray ones are of size 2ko x 2ko . 

for a general p x p symmetric matrix. A key in our construction is to make 
blocks larger for entries that are farther away from the diagonal and take 
advantage of the approximately banding structure of the covariance matrices 
in C a . Before we give a precise description of the construction of the blocks, 
it is helpful to graphically illustrate the construction in the following plot. 

Due to the symmetry, we shall focus only on the upper half for brevity. We 
start by constructing blocks of size ko x ko along the diagonal as indicated by 
the darkest squares in Figure 2. Note that the last block may be of a smaller 
size if ko is not a divisor of p. Next, new blocks are created successively 
toward the top right corner. We would like to increase the block sizes along 
the way. To this end, we extend to the right from the diagonal blocks by 
either two or one block of the same dimensions (ko x ko) in an alternating 
fashion. After this step, as exhibited in Figure 2, the odd rows of blocks will 
have three ko x ko blocks, and the even rows will have two ko x ko in the 
upper half. Next, the size of new blocks is doubled to 2ko x 2ko- Similarly to 
before, the last block may be of smaller size if 2ko is not a divisor of p, and 
for the most part, we shall neglect such a caveat hereafter for brevity. The 
same procedure is then followed. We extend to the right again by three or 
two blocks of the size 2ko x 2ko. Afterwards, the block size is again enlarged 
to 2 2 ko x 2 2 ko and we extend to the right by three or two blocks of size 
2 2 ko x 2 2 ko- This procedure will continue until the whole upper half of the 
px p matrix is covered. For the lower half, the same construction is followed 
to yield a symmetric blocking of the whole matrix. 

The initial block size ko can take any value as long as ko >c log p. In 
particular, we can take ko = [logp\ • The specific choice of ko does not impact 
the rate of convergence, but in practice it may be beneficial sometimes to 
use a value different from [logpj • I n what follows, we shall keep using ko for 
the sake of generality. 
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For notational purposes, hereafter we shall refer to the collection of index 
sets for the blocks created in this fashion as B = {B\, . . . , Bn} where B k = 
Ik x Jfc for some subintervals Ik, «/& C {1, . . . ,p}. It is clear that B forms a 
partition of {1,2,... ,p} 2 , that is, 

B kl DB k2 =0 iffci^fc 2 and B x U B 2 U • • • U B N = {1, 2, . . . ,p} 2 . 

For a p x p matrix A = (aij)i<i,j< p and an index set B = I x J £ i3, we 
shall also write Ab = (aij)i^ij^j, a |/| x \ J\ submatrix of A. Hence A is 
uniquely determined by {Ab '■ B £ B} and the partition B. With slight abuse 
of notation, we shall also refer to an index set B as a block when no confusion 
occurs, for the sake of brevity. 

Denote by d(B) the dimension of B, that is, 

d(B) = max{card(I), card(J)}. 

Clearly by construction, most of the blocks in B are necessarily square in 
that card(I) = card(J) = d(B). The exceptions occur when the block sizes 
are not divisors of p, which leaves the blocks along the last row and column 
in rectangles rather than squares. We opt for the more general definition of 
d(B) to account for these rectangle blocks. 

2.2. Block thresholding. Once the blocks are constructed, the next step 
is to estimate the entries of the covariance matrix £, block by block, through 
thresholding the corresponding blocks of the sample covariance matrix based 
on the location, block size and corresponding spectral norms. 

We now describe the procedure in detail. Denote by £ the block thresh- 
olding estimator, and let B = I x J £ B. The estimate of the block is 
defined as follows: 

(a) keep the diagonal blocks: = S# if B is on the diagonal, that is, 
I=J; 

(b) "kill" the large blocks: t, B = if d(B) > n/logn; 

(c) threshold the intermediate blocks: For all other blocks B, set 

(2.1) E B = T Ao (S B ) = t B • IM|E B || > A yj ||Ej x j|| ||S Jx j 

where Ao > is a turning parameter. Our theoretical development indicates 
that the resulting block thresholding estimator is optimally rate adaptive 
whenever Ao is a sufficiently large constant. In particular, it can be taken 
as fixed at Ao = 6. In practice, a data-driven choice of Ao could potentially 
lead to further improved finite sample performance. 

It is clear from the construction that the block thresholding estimate £ is 
fully data-driven and does not require the knowledge of a. The choice of the 
thresholding constant Ao comes from our theoretical and numerical studies. 
See Section 5 for more discussions on the choice of Aq. 



d{B)+logp 



n 
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We should also note that, instead of the hard thresholding operator T\ , 
more general thresholding rules can also be applied in a similar blockwise 
fashion. In particular, one can use block thresholding rules T\ (£b) = £_b • 



and t\ B is a univariate thresholding rule. Typical examples include the soft 
thresholding rule t\ B (z) = {\z\ — Xb)+ sgn(z) and the so-called adaptive lasso 
rule t\ B (z) = z(l — \Xb/ z\ ri ) + for some rj > 1, among others. Rothman, Lev- 
ina and Zhu (2009) considered entrywise universal thresholding for estimat- 
ing sparse covariance matrix. In particular, they investigate the class of 
univariate thresholding rules t\ B such that (a) |i^, (^r) | < \z\; (b) t\ B (z) = 
if \z\ < Xb] and (c) \t\ B {z) — z\ < Xb- Although we will focus on the hard 
thresholding rule in the present paper for brevity, all the theoretical results 
developed here apply to the more general class of block thresholding rules 
as well. 

3. Adaptivity. We now study the properties of the proposed block thresh- 
olding estimator £ and show that the estimator simultaneously achieves the 
minimax optimal rate of convergence over the full range of C a for all a > 0. 
More specifically, we have the following result. 

Theorem 3.1. Let £ be the block thresholding estimator o/£ as defined 
in the Section 2. Then 



£eC Q (M ,M) I re nj 

for all a > 0, where C is a positive constant not depending on n and p. 

Comparing with the minimax rate of convergence given in Cai, Zhang and 
Zhou (2010), this shows that the block thresholding estimator £ is optimally 
rate adaptive over C a for all a > 0. 

Remark 1 . The block thresholding estimator £ is positive definite with 
high probability, but it is not guaranteed to be positive definite. A simple 
additional step, as was done in Cai and Zhou (2011), can make the final 
estimator positive semi-definite and still achieve the optimal rate of con- 
vergence. Write the eigen-decomposition of £ as £ = Y^,=\ ^i v i v J ■> where 
Aj's and Uj's are, respectively, the eigenvalues and eigenvectors of £. Let 
= max(Aj,0) be the positive part of X\, and define 



*a s (||£b||) where 




(3.1) 



sup 




V 



i=l 
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Then £ + is positive semi-definite, and it can be shown easily that £ + attains 
the same rate as £. See Cai and Zhou (2011) for further details. If a strictly 
positive definite estimator is desired, one can also set Xf = max(Aj,e n ) for 
some small positive value e n , say e n = 0(\ogp/n), and the resulting estima- 
tor £ + is then positive definite and attains the optimal rate of convergence. 

The inverse of the covariance matrix, := £ _1 , is of significant interest in 
many applications. An adaptive estimator of f2 can also be constructed based 
on our proposed block thresholding estimator. To this end, let £ = UDU T 
be its eigen-decomposition, that is, U is an orthogonal matrix, and D is a 
diagonal matrix. We propose to estimate fi by 

^ = {7diag(min{4 1 ,n})f7 T , 

where da is the ith diagonal element of D. The truncation of d^ 1 is needed 
to deal with the case where £ is near singular. The result presented above 
regarding £ can be used to show that Ct adaptively achieves the optimal 
rate of convergence for estimating Q. 

Theorem 3.2. Let be defined as above. Then 

sup Ellft - 0|| 2 < Cmin(n- 2Q /( 2a+1 ) + ^1,1] 
EGC Q I n n) 

for all a > 0, where C > is a constant not depending on n and p. 

The proof of the adaptivity results is somewhat involved and requires 
some new technical tools. The main ideas in the theoretical analysis can be 
summarized as follows: 

• The different £ — £ can be decomposed into a sum of matrices such that 
each matrix in the sum only consists of blocks in B that are of the same 
size. The individual components in the sum are then bounded separately 
according to their block sizes. 

• Although the sample covariance matrix £ is not a reliable estimator of 
£, its submatrix, £#, could still be a good estimate of £#• This is made 
precise through a concentration inequality. 

• The analysis on the whole matrix is reduced to the analysis of a matrix 
of much smaller dimensions, whose entries are the spectral norms of the 
blocks, through the application of a so-called norm compression inequality. 

• With high probability, large blocks in {£b : B E B}, which correspond to 
negligible parts of the true covariance matrix £, are all shrunk to zero 
because by construction they are necessarily far away from the diagonal. 

We shall elaborate below these main ideas in our analysis and introduce 
some useful technical tools. The detailed proof is relegated to Section 6. 
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(a)S(-,l) (b)5(-,2) 

Fig. 3. Decompose a matrix into the sum of matrices of different block sizes: S(-,l) on 
the left and S(-,2) on the right. All entries in the unshaded area are zero. 



3.1. Main strategy. Recall that B is the collection of blocks created using 
the procedure in Section 2.1, and it forms a partition of {1,2, . . . ,p} 2 . We 
analyze the error E — £ by first decomposing it into a sum of matrices such 
that each matrix in the sum only consists of blocks in B that are of the same 
size. More precisely, for a px p matrix A, define S(A; I) to be a p x p matrix 
whose entry equals that of A if belongs to a block of dimension 
2 l ~ 1 ko, and zero otherwise. In other words, 

BeB:d(B)=2 l - 1 k 

With this notation, E — £ is decomposed as 

£ - £ = 5(£ - £, 1) + S(£ - £, 2) + • ■ • . 

This decomposition into the sum of blocks of different sizes is illustrated in 
Figure 3 below. 

We shall first separate the blocks into two groups, one for big blocks and 
another for small blocks. See Figure 4 for an illustration. By the triangle 
inequality, for any L > 1 , 

(3.2) ||£-E||<^||S(£-£,0||+ £S(Il-£,0 

KL 1>L 

The errors on the big blocks will be bounded as a whole, and the errors on 
the small blocks will be bounded separately according to block sizes. With 
a careful choice of the cutoff value L, it can be shown that there exists a 
constant c> not depending on ti and. p such that for any o. > and S £ Cai 

(3.3) E^||5(E - £,Z)|| V = cmin{n- 2a /( 2a+1 ) + £ j, 
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(a) Small blocks 



(b) Large blocks 



Fig. 4. Small blocks and large blocks are treated separately. Small blocks are necessarii 
close to the diagonal and large blocks are away from the diagonal. 



and 

(3.4) E J^5(S-S,Z) 

1>L 

which then implies Theorem 3.1 because 



cmm< n 



-2a/(2a+l) 



+ 



logp p 



n n 



E||E-E|| 2 <2e(^||S'(£-E,Z)||') + 2E ^S*(S-£ 

KL ' 1>L 



,1) 



The choice of the cutoff value L depends on p and n and different ap- 
proaches are taken to establish (3.3) and (3.4). In both key technical 
tool we shall use is a concentration inequality on the deviation of a block 
of the sample covariance matrix from its counterpart of the true covariance 
matrix, which we now describe. 

3.2. Concentration inequality. The rationale behind our block thresh- 
olding approach is that although the sample covariance matrix £ is not a 
reliable estimator of E, its submatrix, E#, could still be a good estimate 
of Tib- This observation is formalized in the following theorem. 

Theorem 3.3. There exists an absolute constant cq > such that for 
allt>\, 



>( p| [||E b -E b || < C ot v / ||S/x/||||E JxJ | 



d(B) + log p 



n 



In particular, we can take cq = 5.44. 
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Theorem 3.3 enables one to bound the estimation error E — E block by 
block. Note that larger blocks are necessarily far away from the diagonal 
by construction. For bandable matrices, this means that larger blocks are 
necessarily small in the spectral norm. From Theorem 3.3, if Ao > Co, with 
overwhelming probability, 



|Eb|| < ||E B || +cq\/||Ej x /||||E 



JxJ| 



d{B) + \ogp 



n 



<AoVl|S/x/||||S 



JxJ| 



d{B) +\ogp 



n 



for blocks with sufficiently large sizes. As we shall show in Section 6, ||E/ X /|| 
and ||E j x j\\ in the above inequality can be replaced by their respective sam- 
ple counterparts. This observation suggests that larger blocks are shrunken 
to zero with our proposed block thresholding procedure, which is essential 
in establishing (3.4). 

The treatment of smaller blocks is more complicated. In light of Theo- 
rem 3.3, blocks of smaller sizes can be estimated well, that is, E# is close 
to for B of smaller sizes. To translate the closeness in such a blockwise 
fashion into the closeness in terms of the whole covariance matrix, we need 
a simple yet useful result based on a matrix norm compression transform. 

3.3. Norm compression inequality. We shall now present a so-called norm 
compression inequality which is particularly useful for analyzing the proper- 
ties of the block thresholding estimators. We begin by introducing a matrix 
norm compression transform. 

Let Abeapxp symmetric matrix, and let p±, . . . ,pc be positive integers 

such that pi H \~Pg = P- The matrix A can then be partitioned in a block 

form as 

/A u A 12 ... A 1G \ 
A21 A 2 2 ■ ■ ■ A 2 g 



\A G1 A G2 



Ago/ 



where Aij is a pi x pj submatrix. We shall call such a partition of the ma- 
trix A a regular partition and the blocks A^ regular blocks. Denote by 



5Xp 



a norm compression transform 
/Pull M12II 

11^-21 1| 11^-22 || 



A^M(A; Pl ,...,p G ) 



PigII \ 

\\A2G\\ 



\\\A G1 \\ \\A G2 \\ ... \\Agg\\J 
The following theorem shows that such a norm compression transform does 
not decrease the matrix norm. 



ADAPTIVE COVARIANCE MATRIX ESTIMATION 



13 



Theorem 3.4 (Norm compression inequality). For any p x p matrix A 
and block sizes Pi,P2, ■ ■ ■ ,PG such that p\ + • • • + pc = p, 

\\A\\<\\M{A-p u ...,p G )\\. 

Together with Theorems 3.3 and 3.4 provides a very useful tool for bound- 
ing S{Ti — £,/). Note first that Theorem 3.4 only applies to a regular parti- 
tion, that is, the divisions of the rows and columns are the same. It is clear 
that S(-,l) corresponds to regular blocks of size ko x ko with the possible 
exception of the last row and column which can be of a different size, that 
is, pi =P2 = ■ ■ ■ = ko. Hence, Theorem 3.4 can be directly applied. However, 
this is no longer the case when I > 1. 

To take advantage of Theorem 3.4, a new blocking scheme is needed for 
S(-,l). Consider the case when 1 = 2. It is clear that S(l,2) does not form a 
regular blocking. But we can form new blocks with p\ =pi = ■ ■ ■ = ko, that 
is, half the size of the original blocks in S(-,2). Denote by the collection 
of the new blocks B' . It is clear that under this new blocking scheme, each 
block B of size 2ko consists of four elements from B' . Thus 

S(A,2)= AoI((i,j)eB)= AoI((i,j)eB>). 

bgb B'eB' 

d(B)=2k 3BeB such that d(B)=2k 

and B'CB 

Applying Theorem 3.4 to the regular blocks B' yields 

\\S(A,2)\\ <\\M(S(A,2);k ,. . . ,k )\\, 

which can be further bounded by 

\\M(S(A,2);k ,...,k )\\ ei , 

where || • stands for the matrix i\ norm. Observe that each row or column 
of Af(S(A, 2); ko, . . . , ko) has at most 12 nonzero entries, and each entry is 
bounded by 

max II^WII < max 1 1 ^4 ^ 1 1 

B'eB' BeB 

BBGB such that d(B)=2k d(B)=2k 
and B'CB 

because B' C B implies ||Ab'|| < II^bII- This property suggests that HS^E — 
£,£)|| can be controlled in a block-by-block fashion. This can be done using 
the concentration inequalities given in Section 3.2. 

The case when I > 2 can be treated similarly. Let P2j-i = (2 l ~ l — 3)ko and 
P2j = 3/co for j = 1, 2, .... It is not hard to see that each block B in B of size 
2 l ~ 1 ko occupies up to four blocks in this regular blocking. And following the 
same argument as before, we can derive bounds for S(A,l). 

The detailed proofs of Theorems 3.1 and 3.2 are given in Section 6. 
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4. Numerical results. The block thresholding estimator S proposed in 
Section 2 is easy to implement. In this section we turn to the numerical 
performance of the estimator. The simulation study further illustrates the 
merits of the proposed block thresholding estimator. The performance is 
relatively insensitive to the choice of ko, and we shall focus on fco = |1°§pJ 
throughout this section for brevity. 

We consider two different sets of covariance matrices. The setting of our 
first set of numerical experiments is similar to those from Cai, Zhang and 
Zhou (2010). Specifically, the true covariance matrix £ is of the form 



where the value of p is set to be 0.6 to ensure positive definiteness of all 
covariance matrices, and independently sampled from a uniform 

distribution between and 1. 

The second settings are slightly more complicated, and the covariance 
matrix S is randomly generated as follows. We first simulate a symmetric 
matrix A = (a,j) whose diagonal entries are zero and off-diagonal entries 
a ij (* < j) are independently generated as ~ N(0, \i — j|~ 4 ). Let X m i n (A) 
be its smallest eigenvalue. The covariance matrix S is then set to be S = 
max(0, — l.lA m i n (^4))J + A to ensure its positive definiteness. 

For each setting, four different combinations of p and n are considered, 
(n,p) = (50, 50), (100, 100), (200, 200) and (400,400), and for each combina- 
tion, 200 simulated datasets are generated. On each simulated dataset, we 
apply the proposed block thresholding procedure with Ao = 6. For com- 
parison purposes, we also use the banding estimator of Bickel and Levina 
(2008a) and tapering estimator of Cai, Zhang and Zhou (2010) on the simu- 
lated datasets. For both estimators, a tuning parameter k needs to be chosen. 
The two estimators perform similarly for the similar values of k. For brevity, 
we report only the results for the tapering estimator because it is known to 
be rate optimal if k is appropriately selected based on the true parameter 
space. It is clear that for both our settings, £ 6 C a with a = 1. But such 
knowledge would be absent in practice. To demonstrate the importance of 
knowing the true parameter space for these estimators and consequently the 
necessity of an adaptive estimator such as the one proposed here, we ap- 
ply the estimators with five different values of a, 0.2,0.4,0.6,0.8 and 1. We 
chose k = [n 1 /( 2a+1 ' J for the tapering estimator following Cai, Zhang and 
Zhou (2010). The performance of these estimators is summarized in Figures 5 
and 6 for the two settings, respectively. 

It can be seen in both settings that the numerical performance of the 
tapering estimators critically depends on the specification of the decay rate 
a. Mis-specifying a could lead to rather poor performance by the tapering 




1 < i = j < P, 
l<i^j<P, 
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Fig. 5. Comparison between the tapering and adaptive block thresholding estima- 
tors — simulation setting 1: each panel corresponds to a particular combination of sample 
size n and dimension p. In each panel, boxplots of the estimation errors, measured in terms 
of the spectral norm, are given for the block thresholding estimator with Ao = 6 and the 
tapering estimator with a = 0.2, 0.4,0.6,0.8 and 1. 



estimators. It is perhaps not surprising to observe that the tapering estima- 
tor with a = 1 performed the best among all estimators since it correctly 
specifies the true decay rate and therefore, in a certain sense, made use of 
the information that may not be known a priori in practice. In contrast, the 
proposed block thresholding estimator yields competitive performance while 
not using such information. 

5. Discussion. In this paper we introduced a fully data-driven covariance 
matrix estimator by blockwise thresholding of the sample covariance matrix. 
The estimator simultaneously attains the optimal rate of convergence for 
estimating bandable covariance matrices over the full range of the parameter 
spaces C a for all a > 0. The estimator also performs well numerically. 

As noted in Section 2.2, the choice of the thresholding constant Ao = 6 is 
based on our theoretical and numerical studies. Similar to wavelet thresh- 
olding in nonparametric function estimation, in principle other choices of Ao 
can also be used. For example, the adaptivity results on the block thresh- 
olding estimator holds as long as Aq > 5.44 (= \/24/(l — 2e~ 3 )) where the 
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Fig. 6. Comparison between the tapering and adaptive block thresholding estima- 
tors — simulation setting 2: each panel corresponds to a particular combination of sample 
size n and dimension p. In each panel, boxplots of the estimation errors, measured in terms 
of the spectral norm, are given for the block thresholding estimator with Ao = 6 and the 
tapering estimator with a = 0.2, 0.4,0.6,0.8 and 1. 



value 5.44 comes from the concentration inequality given in Theorem 3.3. 
Our experience suggests the performance of the block thresholding estima- 
tor is relatively insensitive to a small change of Ao- However, numerically 
the estimator can sometimes be further improved by using data-dependent 
choices of Ao- 

Throughout the paper, we have focused on the Gaussian case for ease of 
exposition and to allow for the most clear description of the block thresh- 
olding estimator. The method and the results can also be extended to more 
general subgaussian distributions. Suppose that the distribution of the X^'s 
is subgaussian in the sense that there exists a constant a > such that 

(5.1) F{\v T (X-KX)\ >t} <e~ t2/2a2 for all t > and ||v|| = 1. 

Let F a (a, Mq , M) denote the collection of distributions satisfying both (1.1) 
and (5.1). Then for any given ao > 0, the block thresholding estimator £ 
adaptively attains the optimal rate of convergence over T a Mo , M) for 
all a, Mq,M > and < a < ctq whenever Aq is chosen sufficiently large. 
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In this paper we have focused on estimation under the spectral norm. The 
block thresholding procedure, however, can be naturally extended to achieve 
adaption under other matrix norms. Consider, for example, the Frobenius 
norm. In this case, it is natural and also necessary to threshold the blocks 
based on their respective Frobenius norms instead of the spectral norms. 
Then following a similar argument as before, it can be shown that this 
Frobenius norm based block thresholding estimator can adaptively achieve 
the minimax rate of convergence over every C a for all a > 0. It should also be 
noted that adaptive estimation under the Frobenius norm is a much easier 
problem because the squared Frobenius norm is entrywise decomposable, 
and the matrix can then be estimated well row by row or column by column. 
For example, applying a suitable block thresholding procedure for sequence 
estimation to the sample covariance matrix, row-by-row would also lead to 
an adaptive covariance matrix estimator. 

The block thresholding approach can also be used for estimating sparse 
covariance matrices. A major difference in this case from that of estimating 
bandable covariance matrices is that the block sizes cannot be too large. 
With suitable choices of the block size and thresholding level, a fully data- 
driven block thresholding estimator can be shown to be rate-optimal for 
estimating sparse covariance matrices. We shall report the details of these 
results elsewhere. 

6. Proofs. In this section we shall first prove Theorems 3.3 and 3.4 and 
then prove the main results, Theorems 3.1 and 3.2. The proofs of some 
additional technical lemmas are given at the end of the section. 

6.1. Proof of Theorem 3.3. The proof relies the following lemmas. 

Lemma 1. Let A he a 2 x 2 random matrix following the Wishart dis- 
tribution W(n, Aq) where 



P(|Ai2 - p\ > x) < 2F(\W n -n\> nx), 

where W n ~ Xn ■ 

Proof. Let Z = {Zi,Z 2 ) T ~ N(0,A ) and Z^\...,Z^ be n indepen- 
dent copies of Z. Let 




Then 
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Note that 



(i) 7 (i)\2 



i=l 



t=l 



Therefore, 

n\si2-p\>x) 



< 



(^E((^ ) + z ? ) ) 2 -2(i+p)) 



>2(l + p)x 



+ 



£((Z«-Z«) 2 -2(l-p)) 



i=l 



>2(l-p)x . 



Observe that 



£((Z« + Z») 2 -2(l + p)) 



i=i 



>2(l + p)x 



A (Z{° + Z^) 2 
■P(|W„-n| >s). 



71 



> X 



Similarly, 



i^((Z«-Z») 2 -2(l-p)) 



i=l 



> 2(1 -p)x) = P(\W n -n\> x). 



The proof is now complete. □ 



Lemma 2. Let B = I x Jc [l,p] . There exists an absolute constant 
cq > suc/i i/iai for any t>l, 



\\Z B ~ Sb|| < coVI|S/x/||||Sjxj| 
In particular, we can take cq = 5.44. 



d(B)+\ogp 



n 



>l-p 



-6t 2 



Proof. Without loss of generality, assume that card(I) = card(J) = 
d{B) = d. Let A be a d x d matrix, ui,U2 and vi, V2 £ S d ~ l where is 
the unit sphere in the d dimensional Euclidean space. Observe that 

|u|*Avi| — |U 2 T " J 4V2| < |u]"j4vi — I1JAV2I 

= |u]"^4(vi - v 2 ) + (ui - u 2 ) T ^v 2 | 
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< |u]A(vi - v 2 )| + |(ui - u 2 ) T Av 2 | 

< ||ui||||A||||vi - V 2 || + ||ui - TJ.2 ] [ [ | ^4. [|]| V2 1 1 

= P||(||vi - V 2 || + ||ui -U 2 ||), 

where as before, we use || • || to represent the spectral norm for a matrix 
and £2 norm for a vector. As shown by Boroczky and Wintsche [(2005), e.g., 
Corollary 1.2], there exists an 5-cover set Q d C of such that 

card(Q d ) < ^V/ 2 i og (l + dcos 2 5) « c5~ d d 3 / 2 log(l + d) 
sin 5 

for some absolute constant c > 0. Note that 

(6.1) \\A\\= SU P u T Av< sup u T Av + 25||yl||. 

u.ves^- 1 u,veQ d 

In other words, 

(6.2) ||A|| < (1 - 26)" 1 sup u T iv. 

Now consider A = - ^B- Let Xi = (Xi : i G I) T and Xj = (A^ : i G J) T . 
Then 

1 n 

^B = ^(X^-X I )(X^-Xj) T , 

where 



X I = {X i :i£l) T and A/ = (A; : i G J) T . 
Similarly, £ B = E(Aj - EX»(Jf j - EAj) T . Therefore, 
1 n 

A = - ^2(X ( j l \x ( j ) ) T - EA/Aj) - (AjAj - EA/EAj). 
71 i=l 

Clearly the distributional properties of A are invariant to the mean of X. 
We shall therefore assume without loss of generality that EX = in the rest 
of the proof. 

For any fixed u, v G S^ 1 , we have 

1 n 



n . 
i=i 

where Fi = u T A 7 , F 2 = v T Aj, and similarly, = u T xf , Y® = v T ' xf . 
It is not hard to see that 

Y 1 \ ^ N f f u T Y, IxI u u T S/ xJ v 
Y 2 J V ' \v T Sj x /u v T Sj x jv 

and u T Av is simply the difference between the sample and population co- 
variance of Y\ and Y 2 . We now appeal to the following lemma: 
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Applying Lemma 1, we obtain 



|u T ylv| >x}< 2F(\W n -n\> — ^ —) , 

V ((u 1 £/ x /u)(v' SjxjvjjVV 



where W n ~ Xn- By the tail bound for y 2 random variables, we have 

nx \ ( nx 2 

((u T S 7x/ u)( V TS JxJ v))V2j - eXp l v "4||S /x/ ||||S JxJ 

See, for example, Lemma 1 of Laurent and Massart (2000). In summary, 

P{|u T ^v| >x}< 2expf- ^ -) . 

V 4 ll^/x7"|| II^Jx j\\ J 

Now an application of union bound then yields 
P(||E b -Sb||>x)<p{ sup u T Av>(l-25)x\ 

n{l-25) 2 x 2 



<2card(Q,j) exp 



4 E/xj EjxJ 



< cS- 2d d 3 log 2 (l + d) exp ( n(1 " 25)2x2 



4 S/xi IIIS 



JxJ| 



for some constant c > 0. In particular, taking 



x = c £\/ll s /x.HII|Sjxj 



<i + logp 



n 

yields 

c o* 2 ft 



F(\\^ B -^ B \\>x)<c6- Zd d- i log z (l + d)exp[ -JL-(i-25y(d + logp) ). 
Let 5 = e -3 and 



C0> T^ = 5 - 44 - 

Then 

P(||E B -S B || >x)<p- 6t \ □ 

We are now in position to prove Theorem 3.3. It is clear that the total 
number of blocks can be upper bounded by card(£>) < (p/k^) 2 < p 2 . It follows 
from the union bound and Lemma 2 that 

pj |J -S^ll > cotdl^x/llllSjxjH) 1 / 2 ^- 1 ^^) + logp)) 1/2 | 
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< ]T P{||Eb - Esll > cotdlS.x.llllSjxjH) 1 / 2 ^ 1 ^) + logp)) 1/2 } 
BeB 



< 



-6t 2 +2 



6.2. Proof of Theorem 3.4- Denote by u, v the left and right singular 
vectors corresponding to the leading singular value of A, that is, u T Av = 
Let u = (ui, . . . , ug) t and v = (vi, . . . , vc) T be partitioned in the same 
fashion as X, for example, u g , v 9 G W 9 . Denote by u* = (||ui||, . . . , ||ug'||) t 
and = (||vi||, . . . , ||vg||) t . It is clear that ||u*|| = ||v*|| = I. Therefore, 



G 

\\N{A)\\>v£N{A)v* 



/ J n u illll v A;||||^jfc|| 
j,k=l 

G 

> ^2 ujE ife v fc = u T Ev = \\A\\. 

j,k=i 

6.3. Proof of Theorem 3.1. With the technical tools provided by The- 
orems 3.3 and 3.4, we now show that E is an adaptive estimator of E as 
claimed by Theorem 3.1. We begin by establishing formal error bounds on 
the blocks using the technical tools introduced earlier. 

6.3.1. Large blocks. First treat the larger blocks. When E G C a , large 
blocks can all be shrunk to zero because they necessarily occur far away 
from the diagonal and therefore are small in spectral norm. More precisely, 
we have: 

Lemma 3. For any B £ B with d{B) > 2ko, 

||E B || < Md(B)- a . 

Together with Theorem 3.3, this suggests that 
I|Eb|| < II Eb — Eb|| + ||Eb|| 

< codlEzx/IIHEjxjII) 1 / 2 ^- 1 ^) + logp)) 1/2 + Md(B)- a , 
with probability at least 1 — p~ 4 . Therefore, when 

(6.3) ^)> C min|n^ + i),^ j 
for a large enough constant c > 0, 

(6.4) ||SB||<i(co + Ao)(||S 7x/ ||||E, /xJ ||) 1/2 (n- 1 (d( J B) + logp)) 1/2 . 

The following lemma indicates that we can further replace ||E/ X /|| and 
||Ej x j|| by their respective sample counterparts. 
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LEMMA 4. Denote by 1 = {I : I x J G £> } . Then for all I el, 

i y/cardCQ + t < p/xjjj < 1 | Vcard(/) + t 
~ l|E/ x /|| ~ \fn 
with probability at least 1 — Ap 2 exp(— 1 2 /2). 

In the light of Lemma 4, (6.4) implies that, with probability at least 
1 — 2p~ 4 , for any B GS such that d(B) < ra/logn and (6.3) holds, 

HSBl^AodlS/x/llllSjxjID^^n-^^ + logp)) 1 / 2 , 

whenever n/logp is sufficiently large. In other words, with probability at 
least 1 — 2p~ 4 , for any B 6 B such that (6.3) holds, E# = 0. 

6.3.2. Small blocks. Now consider the smaller blocks. From the discus- 
sions in Section 3.3, we have 

(6.5) IISTS-E.Z)!! < 12 max ||E B -S S ||. 

B&B:d(B)=2 l ~ 1 k 

Observe that by the definition of E, 
||Eb — Eb|| < ||Eb — Eb|| + ||E_b — Eb|| 

< AodlE/x/IHlEjxjII) 1 / 2 ^ 1 ^) + logp)) 1/2 + ||S fl - E B ||. 

By Lemma 4, the spectral norm of E/ x / and Ej x j appeared in the first 
term on the rightmost-hand side can be replaced by their corresponding 
population counterparts, leading to 

||E B - Efl|| < AodlS/x/llllEjxjH) 172 ^ 1 ^) +logp)) 1/2 + ||E B - E B || 

< X Q M Q {rr\d{B) + log p)) 1/2 + || E B - E B ||, 

where we used the fact that ||E/ X /||, ||Ej x j|| < Mq. This can then be readily 
bounded, thanks to Theorem 3.3: 

||S S - E B || < (A M + c ){n- l {d(B) + logp)) 1/2 . 
Together with (6.5), we get 

(6.6) ||5(E - E,0|| < C{n-\k 2 1 - 1 + logp)) 1/2 . 

6.3.3. Bounding the estimation error. To put the bounds on both small 
and big blocks together, we need only to choose an appropriate cutoff L 
in (3.2). In particular, we take 

r |Tog 2 (p/fco)l, ifp<n x /(2«+i), 

(6.7) L = I |Tog 2 (n 1 / 2o,+1 /Ao)l > if l °SP < n 1/(2a+1) and n x /( 2a+1 ) < p, 

[ riog 2 (logp/fc )l , if n 1 /( 2Q+1 ) < logp, 

where \x] stands for the smallest integer that is no less than x. 
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Small p. If p < n l ^ 2a+l ^ , all blocks are small. From the bound derived 
for small blocks, for example, equation (6.6), we have 



\t - S|| < £||S(E - S, Oil < C^n" 1 ^" 1 **, + \ogp)) 1/2 < Cip/n) 1 



1/2 



with probability at least 1 — 2p~ A . Hereafter we use C > as a generic 
constant that does not depend on p, n or a, and its value may change at 
each appearance. Thus 



E||S - S|| 2 I{||S - S|| < C{p/n) 1/2 } 
+ E||S-S|| 2 I{||S-S|| >C(p/n) 1/2 } 



It now suffices to show that the second term on the right-hand side is 0{p/n). 
By the Cauchy-Schwarz inequality, 

E||S - S|| 2 I(||S - S|| > Cip/n) 1 / 2 ) 

< (E||£ - £|| 4 P{||£ - S|| > C{p/n) 1/2 }) 1/2 
-4m Wr* V l|4\l/2 



Observe that 



< (2p- 4 E||S-S| 



E||S - S|| 4 < E||S - EH 4 , < Cp 4 /n 2 , 



where || • ||p stands for the Frobenius norm of a matrix. Thus, 
E||S - S|| 2 I{||S - E|| > C(p/n)^ 2 } < Cp/n. 

Medium p. When logp < n l ^ 2a+l ' > and n 1 ^ 2a+1 ^ <p, by the analysis 
from Section 6.3.1, all large blocks will be shrunk to zero with overwhelming 
probability, that is, 

p{^5(S,/) = o}>l-2^ 4 . 

When this happens, 



1>L 



1>L 



1>L 



Recall that || • stands for the matrix l\ norm, that is, the maximum row 
sum of the absolute values of the entries of a matrix. Hence, 



]TS(£-£,Z) 

1>L 



< ML~ a < Cn~ a ^ 2a+1 \ 
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As a result, 
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E 



1>L 



E 



£S(E-E,Z) l|^,S(S,Z) = o| 
^S(£-£,Z) l{^S(E,0^o). 

1>L N>L ^ 



1>L 

+ E 



It remains to show that 



E 



]TS(£-E,Z) l|^5(S,0/o|=0(n 



-2a/(2a+l) 



)• 



By the Cauchy-Schwarz inequality, 

2 

E<j 

Z>L 



^S(£-£,Z) i^5(s,o/o)| 

^S(£-£,Z) pj^S(£,/)/oj) 



< E 



1/2 



Observe that 

4 

^5(S-S,0 < ]Ts(E-£,Z) 



< 



< 2 



£S(E-E,Z) ) 

1>L 

J>(£-£,z) + J>(£,z) ) 

1>L F />L F/ 

4 4\ 

^S(£-£,Z) + JS(S,i) , 

F F/ 



where the second inequality follows from the fact that E = E or 0. It is not 
hard to see that 



E 



J>(£-£,Z) 



1>L 

On the other hand, 

4 



<E||E-E||£<C//n 2 . 



Z>L 



\,i:|i-j|>fe 2 i - 1 
<Cn -4a/(2a+l)_ 



4-1 ^ 



,i:|i-i|>fco2 i - 1 
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Therefore, 



< C{p 4 /n 2 + n 



E 5^5(S-S,Z) 

1>L 

Together with Theorem 3.3, we conclude that 

2 

1>L 



-4a/(2o+l)N 



Large p. Finally, when p is very large in that logp > n l ^ 2a+l \ we can 
proceed in the same fashion. Following the same argument as before, it can 
be shown that 

2 



E 



1>L 



<C(n 1 \ogp). 



The smaller blocks can also be treated in a similar fashion as before. From 
equation (6.6), 



KL 



with probability at least 1 — 2p 4 . Thus, it can be calculated that 



E 



<C(n 1 logp). 



Combining these bounds, we conclude that E||£ - £|| 2 < C(n _1 log p). In 
summary, 

-2a/(2a+l) + ^SP P 



sup E||S-Sf <C min< n 



' f ' 

n n 



for all a > 0. In other words, the block thresholding estimator S achieves 
the optimal rate of convergence simultaneously over every C a for all a > 0. 

6.4. Proof of Theorem 3.2. Observe that 

E||fi - 0|| 2 = E(||0 - 0|| 2 I{A min (S) > ±A min (£)}) 

+ E(||fi - fi|| 2 I{A min (S) < ±A min (£)}), 

where A m i n (-) denotes the smallest eigenvalue of a symmetric matrix. Under 
the event that 

•^min(S) > ^A m j n (S), 

£ is positive definite and Cl = Note also that 

US" 1 - S" 1 !! = \\t~ l (t - xoir 1 !! < -s||||s _1 ||. 
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Therefore, 

E( \\h - n\\ 2 l\ X min {t) > -A min (£)l ) <4||0|| 2 E||E-£| 



I n n J 



by Theorem 3.1. On the other hand, 

E(||fi - 0|| 2 I{A min (S) < ±A min (£)}) 



< E((||0|| + ||0||) 2 I{A min (S) < ±A min (£)}) 

\2t, 



< (n+ ||ft||) 2 P{A min (£) < ±A min (£)}. 
Note that 

P{A min (£) < ±A min (E)} < P{||E - E|| > ±A min (E)}. 
It suffices to show that 

n 2 p{||£ - E|| > ^A min (E)l < Cmin{n- 2 -/( 2 «+ 1 ) + ^A- 

Consider first the case when p is large. More specifically, let 

p>n(48A M 2 )" 2 . 

As shown in the proof of Theorem 3.1, 

P{||S-S||>iA min (S)}<4p- 4 . 

It is not hard to see that this implies the desired claim. 
Now consider the case when 

P<n(48A M 2 )" 2 . 

Observe that for each B = I x J G B, 

||E B - E B || < AodlS/x/IHlSjxjII) 1 / 2 ^- 1 ^) + logp)) 1/2 

<Xo\m\(n~ 1 (d(B) + logp)) 1/2 . 

It can then be deduced from the norm compression inequality, in a similar 
spirit as before, that 

||S-S||<^||5(S-E,0|| 
l 

< 12A ||S|| ^(n- 1 (2 i ~ 1 A; + logp)) 1/2 
l 

<12A ||e||(p/»i) 1/2 . 

By the triangle inequality, 

||E - E|| < ||E-E|| + ||E-E||, 
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and 

||S|| < + ||S||. 

Under the event that 

„v ^11 ^ (l/2)A min (S) - 12A (p/n) 1 / 2 A max (S) 1 
1 + 12Ao(p/n) i /" : 5 

we have 

||S — E|| > ^A m ; n (S). 

Now by Lemma 2, 

> iA min (S)| <p|||S-S|| > lA min (S)| <exp(-^isgl 

for some constant c > 0, which concludes the proof. 

6.5. Proof of Lemma 3. The proof relies on the following simple obser- 
vation. 

Lemma 5. For any B G B with dimension d(B) > 4/co, 

min \i — j\ > d(B). 

Proof. Note that for any B £ B, there exists an integer r > such that 
d(B) = 2 r ~ 1 fco- We proceed by induction on r. When r = 3, it is clear by 
construction, blocks of size 4/co x 4/co are at least one 2ko x 2ko block away 
from the diagonal. See Figure 2 also. This implies that the statement is true 
for r = 3. From r + 1 to r + 2, one simply observes that all blocks of size 
2 r+1 ko x 2 r+1 ko is at least one 2 r /co x 2 r ko block away from blocks of size 
2 r - 1 k x 2 r ~ 1 k . Therefore, 

min \i- j\ >2 r k + 2 r k = 2 r+1 k , 
(»j')eB 

which implies the desired statement. □ 

We are now in position to prove Lemma 3 which states that big blocks of 
the covariance matrix are small in spectral norm. Recall that the matrix i\ 
norm is defined as 

v 

\A\i x = sup H-AxH^j = max > \<kj\i 

xgIRP:||x|| fl =l ^\t^ 

for an m x n matrix A = {aij))x<i< m ,l<j<n- Similarly the matrix (.^ norm 
is defined as 



xgRP 



It 

SUp ||^ x IUoo = max / \ a i 
.»„» _i °° Ki<m — ' 
■IMUoo- 1 1 = 1 
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It is well known [see, e.g., Golub and Van Loan (1996)] that 



|^|| 2 < PlkPk 



Immediately from Lemma 5, we have 

H^slk, II^bII^ < max V] \(Tij\ < Md(B)~ a , 

l<i<p 1 — ' 

j:|j-«l>2 r A;o 

which implies ||S_b|| < Md(B)~ a . 

— 1/2 

6.6. Proof of Lemma 4- For any I £ I, write Zj = S Jx j Y. Then the en- 
tries of Z[ are independent standard normal random variables. From the con- 
centration bounds on the random matrices [see, e.g., Davidson and Szarek 
(2001)], we have 



y^ardlQ+t i /2 - 1/2 = ^card(/)+t 

1 S A minl^^l) ^ A max(^J < 1 + "/= 

with probability at least 1 — 2exp(— 1 2 /2) where is the sample covari- 
ance matrix of Zj . Applying the union bound to all I El yields that with 
probability at least 1 — 2p 2 exp(— i 2 /2), for all I 



Vcard(/)+t 1/2 N <x l/2/y y/caid(I)+t 
1 ^ Kin^Zj) < X^A^Zj) < 1 + ^= • 

Observe that S/ x / = Ej^EzjSj^. Thus 

A mm(Sz 7 )A max (S/ x /) < A max (S/ x /) < A max (Sz / )A max (S/ x /), 

which implies the desired statement. 
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